Brief Summary of Data
And Data Source
This is a personal project. The motivation for this project is to
learn new bioinformatics skills related to viral research and expand my
knowledge. This project analyzes CMV (Human betaherpesvirus 5) genomic
data related to congenital infections. Data analysis comprises finding
drug resistance mutations, CMV genotyping, and phylogenetic analysis of
samples.
Raw sequence data used in this project has been previously shared in
NCBI-SRA by the University of Glasgow (HCMV_Congenital_Collection) under
the BioProject Accession Number PRJEB48078 (https://www.ncbi.nlm.nih.gov/bioproject/PRJEB48078).
This BioProject contains 3 Genomic DNA submitted by Salvatore Camiolo
(University of Glasgow, School of Infection & Immunity). Instrument
used was Illumina MiSeq in PRJEB48078. And researchers used hybrid
selection as a selection method.
All bash and R-scripts and python notebooks related to the current
project have been shared in the GitHub repository https://github.com/osmanmerdan/Congenital-CMV. Tool
parameters have been explained in detail in scripts files.
This project does not aim to find optimal best practices to
analyze CMV NGS data. For readers who are interested in a user-friendly,
fast, all-in-one solution to analyze the CMV genome, please check
GRACy(Camiolo et al. (2021)). This HTML
document was created using RMarkdown. Plots were created using the
ggplot package if otherwise are not indicated.
CMV Viral
Particle
Human cytomegalovirus formally known as Human Herpes Virus 5 (HHV5)
is a member of Herpesviridae family. HHV-6, HHV-7, and CMV are
classified as betaherpesviruses.
Complete HCMV particles have a diameter of 120-200 nm. HCMV
nucleocapsid encircles large (220- to 240-kb) linear double-stranded DNA
genome. Outside of nucleocapsid there is matrix and a surrounding
phospholipid-rich envelope.
The HCMV genome is the largest among herpesviruses. It consists of
more than 170 nonoverlapping ORFs (Open Reading Frames). Genome encodes
structural, regulatory and immune modulatory proteins.
CMV is one of most important pathogenes causing congenital infections
ultimately leading sensorineural hearing loss and neurodevelopmental
problems.
HCMV DNA enters the host cell in linear form. HCMV genome has an
unpaired base at each end which facilitates circularization of the
genome which enables rolling circle replication. This replication
mechanism creates genome length covalently linked copies of viral genome
which are called concatemer. Concatemers are later cleved.
HCMV genome has class E organization with two domains that can be
inverted relative to each other (Fig 1). That organization
yields equal amounts of four genomic isomers that can be isolated from
samples. These two domains are the long and short genome segments (L and
S). Each domain contains a unique central region (UL and US). Those
regions are flanked by repeated components that reside at either the
terminal regions of the genome (TRL and TRS) or intersection of L and S
segments (IRL and IRS). General organization of HCMV genome is
TRL-UL-IRL -IRS-US-TRS. Terminal end internal repeats shares a region of
couple of hundreds bps, called a sequence.
Quality Control Of
Trimmed and Filtered NGS Data
Paired end seqeunce data of 36 samples were downloaded from SRA using
sratoolkit (3.0.0) fastq-dump. Reads were timmed with Trimmomatic (Bolger, Lohse, and Usadel (2014)) (version
0.39). The minimum sequence length was set to 200 bp. After Trimmomatic
operations, FastQC (“FastQC”
(2015)) and MultiQC (Ewels et al.
(2016)) were performed to get summary statistics about
raw-filtered NGS data.
The below table shows some of the summary statistics.
HCMV
wild-type strain Merlin genome size is 235646 bp. To
achieve reliable results from the variant analysis, at least a 20 bp
coverage threshold has been set in present NGS data analysis project.
That means at least 20000 (200bp long) reads are needed
(red line in
the Plot 1). Some of the low unique read count samples are:
HCMV Merlin strain genome has a GC content of %57. All the samples had
some level of deviation from the expected %57, which several issues like
contamination, PCR bias, low genome coverage, and hypervariable regions
in the CMV genomes may have caused (Fig 2).
Mean quality scores were sufficient enough for downstream analysis
(Fig 3).
Alignment
Results
Reads were aligned to
Human herpesvirus five strain Merlin, complete
genome (GenBank Accession: AY446894.2) using BWA-MEM (
Heng Li (2013)) (Version: 0.7.17-r1188).
Alignment files were filtered and sorted using Samtools (
Danecek et al. (2021)) (Version: 1.16.1). Output
bam files were de-duplicated using Picard MarkDuplicates (
“Picard Toolkit” (2018)).
Some important summary results from de duplicated-bam alignment files
were presented below.
Samples with mean depth lower than 20 or coverage lower than 90% were
listed below.
ERR7018474 and ERR7018433 could cover 50% of the reference genome with
depth(X)≥1
(Plot 2). Five samples (ERR7018443, ERR7018474,
ERR7021883, ERR7029111, ERR7039821) had low coverage and low depth
across the genome
(Fig 4). A large chunk of the reference
genome had depth(X)<20 for these five samples
(Plot 2). They
were not further processed.
There were three regions not covered in any sample, in the beginning, at
the end, and around the 200000th position
(Fig 4). Those
regions correspond to TRL, TRS, IRS, and IRL regions in AY446894.2,
which are listed below.
However, there were indications of possible deletions around positions
6000 and 11000
(Plot 3). Record ERR7036363 had zero depth in
those two regions
(Plot 3). ERR7032218 showed signs of deletion
only around position 11000. ERR7024258 was included for
comparison.
ERR7036363 showed significant sequence variation in the RL6 gene, as
indicated by soft-clipped reads
(Fig 5). That’s why ER7036363
had zero reads mapped at corresponding positions. ERR7032218 has some
SNPs and a deletion in the RL6 gene
(Fig 5). ERR7024258 has
nucleotide content very similar to the Merlin strain at corresponding
parts
(Fig 5).
ERR7036363 and ERR7032218 showed significant sequence variation with
signs of possible deletion in regions corresponding to the genes RL11,
RL12, RL13, and UL1
(Fig 6). In addition, ERR7024258 has many
sequence variations compared to the Merlin strain in these gene regions
(Fig 6).
RL6, RL12, RL13, and UL1 genes are members of the RL11 gene family.
These regions are known for showing sequence variations between clinical
isolates. (Sijmons, Van Ranst, and Maes
(2014), Dolan et al. (2004))
In the
Fig4 positions between 180800 and 181300, there is a
uniform drop in the the number of reads mapped across the samples.
However, some samples had a second drop in the number of reads mapped
around position 186800. ERR7025502 and ERR7039855 had zero mapped reads
in the region neighboring position 187500
(Plot 4). ERR7025502,
ERR7039855 and ERR7019208 had zero reads mapped around position 181000
(Plot 4).
UL146 gene has been shown to be particularly variable.(
Dolan et al. (2004))
(Fig 7).
ERR7025502, ERR7039855 and ERR7019208 had soft clipped reads in UL146
region indicating large amount of sequence variation between samples and
reference Merlin strain
(Fig 7).
Another region showing sequence variation between different HCMV
clinical isolates is the UL139 region (
Qi et al.
(2006)). The alignment status of reads belonging to ERR7025502,
and ERR7039855 showed sequence variation in UL139-150A regions
(Fig
8).
In summary, CMV has a lot of regions that pose high levels of sequence
variation between clinical strains. Those regions could be rendered as
low-coverage, low-depth regions during alignment process.
Variants
Variants were called on deduplicated-bam files and filtered using
bcftools (version 1.16).(
H. Li (2011)).
Later SnpEff database for AY446894.2 was build following instructions
shared on SnpEff website. (
http://pcingola.github.io/SnpEff/se_faq/#how-to-building-an-ncbi-genome-genbank-file)
Variants in coding regions were annotated using SnpEff (
Cingolani, Platts, et al. (2012)). Annotated
variants were extracted with SnpSift (
Cingolani,
Patel, et al. (2012)). Number of different annotations were
summarized below table.
Note that beacuse the of overlapping genes
and regions below numbers are not exact representation of number of
variants. During the variant extraction process each variant extracted
seperate rows to enable quick downstream pprocess.
Some samples had a high number of non-synonymous SNPs at regions RL12
and RL13
(Plot 5, Plot 6). However, nearly half of the samples
had zero non-synonymous SNP in RL12 and RL13
(Plot 6). RL12 and
RL13 bind Fc portion of human IgG1 and IgG2.(
Cortese et al. (2012)) and they show significant
number of sequence variations between clinical strains. Manually
inspecting alignments files for the R12 and RL13 genes revealed that
some samples (e.g.ERR7018455) had no coverage in the region which lead
zero detected sequence variation
(Fig 9, Fig 10). Hypervariable
regions in HCMV genome leads to complicated alignments which can
significantly alter the sequence variation results.
A higher number of non-synonymous SNPs were observed in UL8, UL74, UL75,
and UL150 genes
(Plot 5, Plot 6). The products of these genes
were summarized in the following table.
UL74, and UL75 are involved in cell entry (
Ye et
al. (2020)). UL74 and UL75 codes for envelope glycoproteins O and
H, respectively. UL74 is related to cell tropism and immunomodulation
(
Sijmons, Van Ranst, and Maes (2014)).
UL74 region is also a hypervariable region. UL150 is involved in immune
regulation which can explain high number of non-synonymous SNPs.
Membrane glycoprotein 8 is the product of UL8. Each sample had 30 or
more non-synonymous SNPs in the UL8 gene
(Plot 6). UL8 is shown
to be able to reduce the production of pro-inflammatory cytokines
significantly. UL7 and UL8 share 195 amino acids (aa) long sequence,
which includes ~35 aa signal peptide and ~100 aa IgV-like domain (
Pérez-Carmona et al. (2018)). Previous studies
have shown that. HCMV strains has %79-%100 sequence identity and most of
the sequence variation in the UL8 gene has been observed in the stalk
region which is ~140 aa long. Similar results obtained from the variant
analysis
(Plot 7). 3’ end of UL8 coding region showed high
number of sequence variations
(Fig 11).
Drug Resistance
Mutations
Main drugs used to treat CMV infections are Ganciclovir, Foscarnet,
Cidofovir, Letermovir and Maribravir. UL54 (DNA polymerase) mutations
can alter Ganciclovir, Foscarnet and Cidofovir succeptibility. UL97
kinase mutations effect Ganciclovir and Maribavir succeptibility.
Maribavir and Letermovir resistance can arise because of the mutations
in UL27 or Terminase gene (UL56, UL89, UL51) respectively.
Annotated-VCF files were searched for the presence of known antiviral
resistance-related mutations using the Hrpesvirus Drug Resistance
Genotyping web server (
http://cmv-resistance.ucl.ac.uk/herpesdrg/). No
antimicrobial resistance mutation has been found in any sample. One
exciting feature is that most samples have Gln126Leu and Asp605Glu amino
acid polymorphisms in the UL97 translated sequence _(Plot8__. Below
table summarizes detected polymorphisms by the server. Asp605Glu
polymorphism previously shown to be more prevalant in infants with
primary HCMV infection compared with HCMV recurrence in transplant
recipients (
Sun et al. (2012)).
No previous literature record was found for Thr75Ala amino acid
variation in UL97 translated sequence. However, it was present in 31
samples
(Plot8). Other mutations in UL97 were already mentioned
in the literature and are frequently found in clinical samples. (
Lurain and Chou (2010)) Characterized
No previous literature record was found for Gln339Arg and Thr1122Ile
amino acid changes in UL54 translated sequence
(Plot 9). Number
of samples has two SNPs (HGVS_C notation c.3365C>T, c.3364A>G)
(Ref Pos 78558 G->A and Ref Pos 78559 T -> C) affecting codon
corresponding to amino acid at position 1122 in UL54 protein
(Fig 12
and Fig 13). Those two SNPs were pictured as two different amino
acid changes (Thr1122Ile and Thr1122Ala) in the annotated vcf files. But
combined effect of two SNPs resulted in a codon change from ACC to GTC
which actually caused Thr1122Val. There were also some samples with
neither of mutations (Thr1122Ile and Thr1122Ala).
Gln339Arg was found in five samples. Remaining mutations were
characterized by other researchers.(
Lurain and
Chou (2010))
There was no resistance related mutations in UL56, UL51, UL89, UL27
genes. Some missense variations were present in nearly all samples
(Plot 10).
Bolger, Anthony M., Marc Lohse, and Bjoern Usadel. 2014.
“Trimmomatic: A Flexible Trimmer for Illumina
Sequence Data.” Bioinformatics 30 (15): 2114–20.
https://doi.org/10.1093/bioinformatics/btu170.
Camiolo, Salvatore, Nicolás M Suárez, Antonia Chalka, Cristina
Venturini, Judith Breuer, and Andrew J Davison. 2021.
“GRACy: A Tool for Analysing Human
Cytomegalovirus Sequence Data.” Virus Evolution 7 (1):
veaa099.
https://doi.org/10.1093/ve/veaa099.
Cingolani, P., V. M. Patel, M. Coon, T. Nguyen, S. J. Land, D. M. Ruden,
and X. Lu. 2012. “Using Drosophila Melanogaster as a Model for
Genotoxic Chemical Mutational Studies with a New Program,
SnpSift.” Frontiers in Genetics 3.
Cingolani, P., A. Platts, M. Coon, T. Nguyen, L. Wang, S. J. Land, X.
Lu, and D. M. Ruden. 2012. “A Program for Annotating and
Predicting the Effects of Single Nucleotide Polymorphisms, SnpEff: SNPs
in the Genome of Drosophila Melanogaster Strain W1118; Iso-2;
Iso-3.” Fly 6 (2): 80–92.
Cortese, Mirko, Stefano Calò, Romina D’Aurizio, Anders Lilja, Nicola
Pacchiani, and Marcello Merola. 2012.
“Recombinant
Human Cytomegalovirus (HCMV)
RL13 Binds Human
Immunoglobulin G Fc.”
Edited by Michael Nevels.
PLoS ONE 7 (11): e50166.
https://doi.org/10.1371/journal.pone.0050166.
Danecek, Petr, James K Bonfield, Jennifer Liddle, John Marshall, Valeriu
Ohan, Martin O Pollard, Andrew Whitwham, et al. 2021.
“Twelve years of SAMtools and BCFtools.”
GigaScience 10 (2).
https://doi.org/10.1093/gigascience/giab008.
Dolan, Aidan, Charles Cunningham, Ralph D. Hector, Aycan F.
Hassan-Walker, Lydia Lee, Clare Addison, Derrick J. Dargan, et al. 2004.
“Genetic Content of Wild-Type Human Cytomegalovirus.”
Journal of General Virology 85 (5): 1301–12.
https://doi.org/10.1099/vir.0.79888-0.
Ewels, Philip, Måns Magnusson, Sverker Lundin, and Max Käller. 2016.
“MultiQC: Summarize Analysis Results for Multiple Tools and
Samples in a Single Report.” Bioinformatics 32 (19):
3047.
https://doi.org/10.1093/bioinformatics/btw354.
Li, H. 2011.
“A Statistical Framework for SNP
Calling, Mutation Discovery, Association Mapping and Population
Genetical Parameter Estimation from Sequencing Data.”
Bioinformatics 27 (21): 2987–93.
https://doi.org/10.1093/bioinformatics/btr509.
Li, Heng. 2013.
“Aligning Sequence Reads, Clone Sequences and
Assembly Contigs with BWA-MEM.” https://doi.org/10.48550/ARXIV.1303.3997.
Lurain, Nell S., and Sunwen Chou. 2010.
“Antiviral
Drug Resistance of Human
Cytomegalovirus.” Clinical Microbiology
Reviews 23 (4): 689–712.
https://doi.org/10.1128/CMR.00009-10.
Pérez-Carmona, Natàlia, Pablo Martínez-Vicente, Domènec Farré, Ildar
Gabaev, Martin Messerle, Pablo Engel, and Ana Angulo. 2018.
“A
Prominent Role of the Human
Cytomegalovirus UL8 Glycoprotein
in Restraining Proinflammatory
Cytokine Production by Myeloid
Cells at Late Times During
Infection.” Edited by Richard M. Longnecker.
Journal of Virology 92 (9): e02229–17.
https://doi.org/10.1128/JVI.02229-17.
Qi, Ying, Zhi-Qin Mao, Qiang Ruan, Rong He, Yan-Ping Ma, Zheng-Rong Sun,
Yao-Hua Ji, and Yujing Huang. 2006.
“Human Cytomegalovirus
(HCMV) UL139 Open Reading Frame:
Sequence Variants Are Clustered into Three Major
Genotypes.” Journal of Medical Virology 78 (4): 517–22.
https://doi.org/10.1002/jmv.20571.
Sijmons, Steven, Marc Van Ranst, and Piet Maes. 2014.
“Genomic and
Functional Characteristics of
Human Cytomegalovirus Revealed by
Next-Generation
Sequencing.” Viruses 6 (3): 1049–72.
https://doi.org/10.3390/v6031049.
Sun, J., G. Cao, L. Zhang, Y. Zhang, Z. Zhao, J. Hu, and Y. Ji. 2012.
“Human Cytomegalovirus (CMV)
UL97 D605E Mutation
Has a Higher Prevalence in
Infants With Primary
CMV Infection Compared
With Transplant Recipients
With CMV Recurrence.”
Transplantation Proceedings 44 (10): 3022–25.
https://doi.org/10.1016/j.transproceed.2012.06.069.
Ye, Lele, Yunyun Qian, Weijie Yu, Gangqiang Guo, Hong Wang, and
Xiangyang Xue. 2020.
“Functional Profile of
Human Cytomegalovirus Genes and
Their Associated Diseases:
A Review.” Frontiers in
Microbiology 11 (September): 2104.
https://doi.org/10.3389/fmicb.2020.02104.